Master’s Degree in Data Science.

I am an international student from Mexico.

Filtering Columns and Rows

When using dplyr, you can filter columns and rows by using select and filter. Let’s look at an example using the Lahman baseball database. We first have to load the Lahman and dyplr packages.

library(Lahman)
library(dplyr)

Now, suppose we would like to see the homerun totals for the 1927 New York Yankees. We could run the following code:

Batting%>%
  select(playerID,yearID,teamID,HR)%>%
  filter(teamID=='NYA' & yearID=='1927')
##     playerID yearID teamID HR
## 1  beallwa01   1927    NYA  0
## 2  bengobe01   1927    NYA  0
## 3  collipa01   1927    NYA  7
## 4  combsea01   1927    NYA  6
## 5  duganjo01   1927    NYA  2
## 6  durstce01   1927    NYA  0
## 7  gazelmi01   1927    NYA  0
## 8  gehrilo01   1927    NYA 47
## 9  giardjo01   1927    NYA  0
## 10 grabojo01   1927    NYA  0
## 11  hoytwa01   1927    NYA  0
## 12 koenima01   1927    NYA  3
## 13 lazzeto01   1927    NYA 18
## 14 meusebo01   1927    NYA  8
## 15 moorewi01   1927    NYA  1
## 16 morehra01   1927    NYA  1
## 17 paschbe01   1927    NYA  2
## 18 pennohe01   1927    NYA  0
## 19 pipgrge01   1927    NYA  1
## 20 ruethdu01   1927    NYA  1
## 21  ruthba01   1927    NYA 60
## 22 shawkbo01   1927    NYA  0
## 23 shockur01   1927    NYA  0
## 24 thomamy01   1927    NYA  0
## 25  weraju01   1927    NYA  1

Examples:

Using the Lahman database i want to find all the instances wheere Yankees players have hit 40 homeruns or more (playerID,teamID,yearID,HR):

Batting%>%
  select(playerID,yearID,teamID,HR)%>%
  filter(HR >= 40 & teamID == 'NYA')
##     playerID yearID teamID HR
## 1   ruthba01   1920    NYA 54
## 2   ruthba01   1921    NYA 59
## 3   ruthba01   1923    NYA 41
## 4   ruthba01   1924    NYA 46
## 5   ruthba01   1926    NYA 47
## 6  gehrilo01   1927    NYA 47
## 7   ruthba01   1927    NYA 60
## 8   ruthba01   1928    NYA 54
## 9   ruthba01   1929    NYA 46
## 10 gehrilo01   1930    NYA 41
## 11  ruthba01   1930    NYA 49
## 12 gehrilo01   1931    NYA 46
## 13  ruthba01   1931    NYA 46
## 14  ruthba01   1932    NYA 41
## 15 gehrilo01   1934    NYA 49
## 16 gehrilo01   1936    NYA 49
## 17 dimagjo01   1937    NYA 46
## 18 mantlmi01   1956    NYA 52
## 19 mantlmi01   1958    NYA 42
## 20 mantlmi01   1960    NYA 40
## 21 mantlmi01   1961    NYA 54
## 22 marisro01   1961    NYA 61
## 23 jacksre01   1980    NYA 41
## 24 martiti02   1997    NYA 44
## 25 giambja01   2002    NYA 41
## 26 giambja01   2003    NYA 41
## 27 rodrial01   2005    NYA 48
## 28 rodrial01   2007    NYA 54
## 29 grandcu01   2011    NYA 41
## 30 grandcu01   2012    NYA 43

This is just a different way to put the dplyr code for filter and select:

Batting%>%
  filter(HR >=40 & teamID == 'NYA')%>%
  select(playerID,yearID,teamID,HR)
##     playerID yearID teamID HR
## 1   ruthba01   1920    NYA 54
## 2   ruthba01   1921    NYA 59
## 3   ruthba01   1923    NYA 41
## 4   ruthba01   1924    NYA 46
## 5   ruthba01   1926    NYA 47
## 6  gehrilo01   1927    NYA 47
## 7   ruthba01   1927    NYA 60
## 8   ruthba01   1928    NYA 54
## 9   ruthba01   1929    NYA 46
## 10 gehrilo01   1930    NYA 41
## 11  ruthba01   1930    NYA 49
## 12 gehrilo01   1931    NYA 46
## 13  ruthba01   1931    NYA 46
## 14  ruthba01   1932    NYA 41
## 15 gehrilo01   1934    NYA 49
## 16 gehrilo01   1936    NYA 49
## 17 dimagjo01   1937    NYA 46
## 18 mantlmi01   1956    NYA 52
## 19 mantlmi01   1958    NYA 42
## 20 mantlmi01   1960    NYA 40
## 21 mantlmi01   1961    NYA 54
## 22 marisro01   1961    NYA 61
## 23 jacksre01   1980    NYA 41
## 24 martiti02   1997    NYA 44
## 25 giambja01   2002    NYA 41
## 26 giambja01   2003    NYA 41
## 27 rodrial01   2005    NYA 48
## 28 rodrial01   2007    NYA 54
## 29 grandcu01   2011    NYA 41
## 30 grandcu01   2012    NYA 43

Find all instances where a player had more than 40 HR but less than 60 strikeouts(playerID,yearID,teamID,HR,SO):

Batting%>%
  filter(HR > 40 & SO<60)%>%
  select(playerID,yearID,teamID,HR,SO)
##     playerID yearID teamID HR SO
## 1  hornsro01   1922    SLN 42 50
## 2  willicy01   1923    PHI 41 57
## 3    ottme01   1929    NY1 42 38
## 4  gehrilo01   1931    NYA 46 56
## 5   ruthba01   1931    NYA 46 51
## 6  gehrilo01   1934    NYA 49 31
## 7  gehrilo01   1936    NYA 49 46
## 8  troskha01   1936    CLE 42 58
## 9  dimagjo01   1937    NYA 46 37
## 10  mizejo01   1940    SLN 43 49
## 11  mizejo01   1947    NY1 51 42
## 12 willite01   1949    BOS 43 48
## 13 kinerra01   1951    PIT 42 57
## 14 camparo01   1953    BRO 41 58
## 15 rosenal01   1953    CLE 43 48
## 16 kluszte01   1954    CIN 49 35
## 17  mayswi01   1954    NY1 41 57
## 18 kluszte01   1955    CIN 47 40
## 19 aaronha01   1957    ML1 44 58
## 20 sievero01   1957    WS1 42 55
## 21 aaronha01   1969    ATL 44 47
## 22 aaronha01   1971    ATL 47 58
## 23 thomafr04   1993    CHA 41 54
## 24 bondsba01   2002    SFN 46 47
## 25 bondsba01   2003    SFN 45 58
## 26 bondsba01   2004    SFN 45 41
## 27 pujolal01   2004    SLN 46 52
## 28 pujolal01   2006    SLN 49 50

Find all instances of Phillies in the 1970’s hitting more than 30 HR (playerID,yearID,teamID,HR):

Batting%>%
  select(playerID,yearID,teamID,HR)%>%
  filter(teamID =='PHI' & yearID>=1970 & yearID<1980 & HR>30)
##    playerID yearID teamID HR
## 1 johnsde01   1971    PHI 34
## 2 schmimi01   1974    PHI 36
## 3 luzingr01   1975    PHI 34
## 4 schmimi01   1975    PHI 38
## 5 schmimi01   1976    PHI 38
## 6 luzingr01   1977    PHI 39
## 7 schmimi01   1977    PHI 38
## 8 luzingr01   1978    PHI 35
## 9 schmimi01   1979    PHI 45

Find every instance of a player hitting more than 50 homeruns but let’s have the players with the most homeruns at the top (playerID,yearID,teamID,HR):

Batting%>%
  select(playerID,yearID,teamID,HR)%>%
  filter(HR>50)%>%
  arrange(desc(HR))
##     playerID yearID teamID HR
## 1  bondsba01   2001    SFN 73
## 2  mcgwima01   1998    SLN 70
## 3   sosasa01   1998    CHN 66
## 4  mcgwima01   1999    SLN 65
## 5   sosasa01   2001    CHN 64
## 6   sosasa01   1999    CHN 63
## 7  marisro01   1961    NYA 61
## 8   ruthba01   1927    NYA 60
## 9   ruthba01   1921    NYA 59
## 10  foxxji01   1932    PHA 58
## 11 greenha01   1938    DET 58
## 12 howarry01   2006    PHI 58
## 13 gonzalu01   2001    ARI 57
## 14 rodrial01   2002    TEX 57
## 15 wilsoha01   1930    CHN 56
## 16 griffke02   1997    SEA 56
## 17 griffke02   1998    SEA 56
## 18  ruthba01   1920    NYA 54
## 19  ruthba01   1928    NYA 54
## 20 kinerra01   1949    PIT 54
## 21 mantlmi01   1961    NYA 54
## 22 ortizda01   2006    BOS 54
## 23 rodrial01   2007    NYA 54
## 24 bautijo02   2010    TOR 54
## 25 davisch02   2013    BAL 53
## 26 mantlmi01   1956    NYA 52
## 27  mayswi01   1965    SFN 52
## 28 fostege01   1977    CIN 52
## 29 mcgwima01   1996    OAK 52
## 30 rodrial01   2001    TEX 52
## 31 thomeji01   2002    CLE 52
## 32 kinerra01   1947    PIT 51
## 33  mizejo01   1947    NY1 51
## 34  mayswi01   1955    NY1 51
## 35 fieldce01   1990    DET 51
## 36 jonesan01   2005    ATL 51

Find all instances of a player striking out less than 10 times, players have to have at lease 400 AB. Players with the smallest number of strikeouts at the top (playerID,yearID,teamID,SO,AB):

Batting%>%
  select(playerID,teamID,yearID,SO,AB)%>%
  filter(SO<10 & AB>=400)%>%
  arrange(SO)
##     playerID teamID yearID SO  AB
## 1  doyleja01    NY1   1894  3 422
## 2  seweljo01    NYA   1932  3 503
## 3  seweljo01    CLE   1925  4 608
## 4  seweljo01    CLE   1929  4 578
## 5  seweljo01    NYA   1933  4 524
## 6   wardjo01    NY1   1893  5 588
## 7  holloch01    CHN   1922  5 592
## 8  mcinnst01    CLE   1922  5 537
## 9  wanerll01    PIT   1936  5 414
## 10 wrighge01    BS1   1875  6 408
## 11 broutda01    BSN   1889  6 485
## 12 keelewi01    BLN   1894  6 590
## 13  wardjo01    NY1   1894  6 540
## 14 quinnjo02    SLN   1895  6 543
## 15 mcinnst01    BSN   1924  6 581
## 16 seweljo01    CLE   1926  6 578
## 17  wardjo01    NY1   1889  7 479
## 18 crossla01    PHI   1893  7 415
## 19 quinnjo02    SLN   1893  7 547
## 20 crossla01    PHI   1894  7 529
## 21 vaughfa01    CIN   1896  7 433
## 22 cochrmi01    PHA   1927  7 432
## 23 seweljo01    CLE   1927  7 569
## 24 traynpi01    PIT   1929  7 540
## 25 muelldo01    NY1   1956  7 453
## 26 connoro01    NY1   1885  8 455
## 27 glassja01    IN3   1887  8 483
## 28 glassja01    NY1   1890  8 512
## 29 donovpa01    PIT   1893  8 499
## 30 dungasa01    CHN   1893  8 465
## 31 pinknge01    LS3   1893  8 446
## 32 brodist01    BLN   1894  8 573
## 33 quinnjo02    SLN   1894  8 405
## 34 bierblo01    PIT   1895  8 466
## 35 crossla01    PHI   1895  8 535
## 36   hoydu01    CIN   1895  8 429
## 37 roushed01    CIN   1921  8 418
## 38 collied01    CHA   1923  8 505
## 39 collied01    CHA   1925  8 425
## 40 speaktr01    WS1   1927  8 523
## 41 cochrmi01    PHA   1929  8 514
## 42 seweljo01    NYA   1931  8 484
## 43 wanerll01    PIT   1933  8 500
## 44 verbaem01    PHI   1947  8 540
## 45 ansonca01    CHN   1883  9 413
## 46 broutda01    DTN   1887  9 500
## 47   hoydu01    WAS   1893  9 564
## 48 bierblo01    PIT   1894  9 525
## 49 broutda01    BLN   1894  9 525
## 50 milledo01    SLN   1894  9 481
## 51 keelewi01    BLN   1896  9 544
## 52 mckeaed01    CL4   1896  9 571
## 53 speaktr01    CLE   1918  9 471
## 54  dealch01    CHN   1921  9 422
## 55 mcinnst01    BOS   1921  9 584
## 56 severha01    SLA   1921  9 472
## 57  highan01    BSN   1926  9 476
## 58 summaho01    CLE   1926  9 581
## 59 seweljo01    CLE   1928  9 588
## 60  ricesa01    WS1   1929  9 616
## 61 leachfr01    NY1   1931  9 515
## 62 busched01    PHA   1945  9 416
## 63 holmeto01    BSN   1945  9 636
## 64 boudrlo01    CLE   1948  9 560
## 65 mitchda01    CLE   1952  9 511

Another way to do the same excersice where we filter first and then select. This is just to avoid putting the AB column in the results.

Batting%>%
  filter(SO<10 & AB>=400)%>%
  select(playerID,teamID,yearID,SO)%>%
  arrange(SO)
##     playerID teamID yearID SO
## 1  doyleja01    NY1   1894  3
## 2  seweljo01    NYA   1932  3
## 3  seweljo01    CLE   1925  4
## 4  seweljo01    CLE   1929  4
## 5  seweljo01    NYA   1933  4
## 6   wardjo01    NY1   1893  5
## 7  holloch01    CHN   1922  5
## 8  mcinnst01    CLE   1922  5
## 9  wanerll01    PIT   1936  5
## 10 wrighge01    BS1   1875  6
## 11 broutda01    BSN   1889  6
## 12 keelewi01    BLN   1894  6
## 13  wardjo01    NY1   1894  6
## 14 quinnjo02    SLN   1895  6
## 15 mcinnst01    BSN   1924  6
## 16 seweljo01    CLE   1926  6
## 17  wardjo01    NY1   1889  7
## 18 crossla01    PHI   1893  7
## 19 quinnjo02    SLN   1893  7
## 20 crossla01    PHI   1894  7
## 21 vaughfa01    CIN   1896  7
## 22 cochrmi01    PHA   1927  7
## 23 seweljo01    CLE   1927  7
## 24 traynpi01    PIT   1929  7
## 25 muelldo01    NY1   1956  7
## 26 connoro01    NY1   1885  8
## 27 glassja01    IN3   1887  8
## 28 glassja01    NY1   1890  8
## 29 donovpa01    PIT   1893  8
## 30 dungasa01    CHN   1893  8
## 31 pinknge01    LS3   1893  8
## 32 brodist01    BLN   1894  8
## 33 quinnjo02    SLN   1894  8
## 34 bierblo01    PIT   1895  8
## 35 crossla01    PHI   1895  8
## 36   hoydu01    CIN   1895  8
## 37 roushed01    CIN   1921  8
## 38 collied01    CHA   1923  8
## 39 collied01    CHA   1925  8
## 40 speaktr01    WS1   1927  8
## 41 cochrmi01    PHA   1929  8
## 42 seweljo01    NYA   1931  8
## 43 wanerll01    PIT   1933  8
## 44 verbaem01    PHI   1947  8
## 45 ansonca01    CHN   1883  9
## 46 broutda01    DTN   1887  9
## 47   hoydu01    WAS   1893  9
## 48 bierblo01    PIT   1894  9
## 49 broutda01    BLN   1894  9
## 50 milledo01    SLN   1894  9
## 51 keelewi01    BLN   1896  9
## 52 mckeaed01    CL4   1896  9
## 53 speaktr01    CLE   1918  9
## 54  dealch01    CHN   1921  9
## 55 mcinnst01    BOS   1921  9
## 56 severha01    SLA   1921  9
## 57  highan01    BSN   1926  9
## 58 summaho01    CLE   1926  9
## 59 seweljo01    CLE   1928  9
## 60  ricesa01    WS1   1929  9
## 61 leachfr01    NY1   1931  9
## 62 busched01    PHA   1945  9
## 63 holmeto01    BSN   1945  9
## 64 boudrlo01    CLE   1948  9
## 65 mitchda01    CLE   1952  9